Update: Understanding GEE

Ward B. Eiling

2024-09-26

Background: GEE vs. MLM

McNeish et al. (2016) provides a nice summary of the differences

Parameter interpretation

  • GEE: population-averaged/marginal, where \(E(Y|X) = X \beta\).
    • for a one-unit change in \(X\), the outcome variable \(Y\) is expected to change by the value of regression coefficient \(\beta\), holding all other predictors constant.
  • MLM: subject-specific/conditional, where \(E(Y|X, u) = X \beta + Z u\).
    • for a one-unit change in \(X\), the outcome variable \(Y\) is expected to change by the value of regression coefficient \(\beta\), holding all other predictors constant and given equal values for the random effects.

Background: GEE estimation

Added comments:

  • Before step 1, we may want to explore the observed correlation matrix.
  • In step 2 we initially fit GLM (assumes independence)

Example 1: Simulation Specification

To hone in on the understanding of the estimation approaches, we want to keep stable across simulations:

  • Continuous variables (predictors, covariates and outcome)
  • No missing data
  • Number of classes = 50 (recommendation of McNeish et al., 2016)
  • Size of classes (i.e., number of timepoints) = 5 (may be too small for random effects of MLM)
  • Causal structure (\(X \to Y_t\))

And we want to vary:

  • MLM - Estimation approaches: ML and REML
  • GEE - Working correlation matrix (independence, exchangeable, AR-M, unstructured)

Example: DAG

Possible research question

  • How do individual differences in sleep problems influence anxiety levels over time?

Let’s draw a DAG where we are estimating the effect of \(X\) (sleep-problems) on \(Y_t\) (anxiety)

graph TD
    X[sleep-problems] --> Y_t[anxiety_t]
graph TD
    X[sleep-problems] --> Y_t[anxiety_t]

Example: Structural Causal Model

Let’s specify the structural causal model (SCM) for the drawn DAG

\[X_i := \mathcal{N}(5, 1) \quad \text{for} \quad i = 1, \ldots, n_{\text{individuals}}\]

\[Y_{it} := 2 + 0.6 \cdot X_i + \epsilon_{it} \quad \text{for} \quad i = 1, \ldots, n_{\text{individuals}} \quad \text{and} \quad t = 1, \ldots, n_{\text{measurements}}\] where \(\epsilon_{i,t}\) follows a multivariate normal distribution with mean zero and an exchangeable correlation structure across time points \(t\):

\[\epsilon_{i,1}, \ldots, \epsilon_{i,n_{\text{measurements}}} \sim \mathcal{N}(0, \Sigma)\]

The covariance matrix \(\Sigma\) has the form:

\[\Sigma = \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots & 1 \end{bmatrix}\]

with \(\rho = 0.7\), representing the exchangeable correlation structure between the errors at different time points within an individual.

Simulating the SCM

# Parameters
set.seed(123)  # For reproducibility
n_individuals <- 100  # Number of individuals
n_measurements <- 5   # Number of measurements per individual
rho <- 0.7  # Correlation between measurements within an individual

# Generate individual-level covariates (X)
X <- rnorm(n_individuals, mean = 5, sd = 1)

# Correlation matrix (exchangeable structure)
corr_matrix <- matrix(rho, n_measurements, n_measurements)
diag(corr_matrix) <- 1  # Diagonal is 1

# Simulate repeated measures for each individual
ex1.data <- data.frame()
for (i in 1:n_individuals) {
  # Generate errors with the working correlation structure
  epsilon_Y <- mvrnorm(1, mu = rep(0, n_measurements), Sigma = corr_matrix)
  
  # Generate Y for this individual across measurement occasions
  Y <- 2 + 0.6 * X[i] + epsilon_Y
  
  # Store the data
  individual_data <- data.frame(
    id = i,
    occasion = 1:n_measurements,
    X = X[i],
    Y = Y
  )
  
  ex1.data <- rbind(ex1.data, individual_data)
}

# Make sure first timepoint is zero (for intercept interpretation)
ex1.data$occasion <- ex1.data$occasion - 1

# Check the first few rows of the data
head(ex1.data, 10)
   id occasion        X        Y
1   1        0 4.439524 5.320932
2   1        1 4.439524 4.794407
3   1        2 4.439524 5.276651
4   1        3 4.439524 5.486515
5   1        4 4.439524 5.536658
6   2        0 4.769823 4.545282
7   2        1 4.769823 5.312338
8   2        2 4.769823 4.481625
9   2        3 4.769823 5.706999
10  2        4 4.769823 4.459495

Step 0: Define statistical model equations

The MLM and GEE model equations share (1) outcome \(Y_{it}\), (2) predictor \(X_i\) and (3) time-varying predictor \(T_{it}\).

MLM model equations (including random effects)

  • \[Y_{it} = \beta_{0i} + \beta_{1i} T_{it} + e_{it}\]
    • \(\beta_{0i} = \gamma_{00} + \gamma_{01} X_{i} + u_{0i}\)
    • \(\beta_{1i} = \gamma_{10} + \gamma_{11} X_i + u_{1i}\)

where:

  • \(\beta_{0i}\) is the random intercept
    • \(\gamma_{00}\) and \(\gamma_{01}\) are the intercept and slope to predict \(\beta_{0i}\) from \(X_i\)
  • \(\beta_{1i}\) is the random slope
    • \(\gamma_{10}\) and \(\gamma_{11}\) are the intercept and slope to predict \(\beta_{1i}\) from \(X_i\)
  • \(e_{it} \sim \mathcal{N}(0,\sigma_e^2)\) is the residual error term at level 1

GEE model equations

  • \[Y_{it} = \beta_0 + \beta_1 X_i + \beta_2 T_{it} + e_{it}\]

where:

  • \(\beta_0\) is the overall intercept
  • \(\beta_1\) captures the fixed effect of \(X_i\)
  • \(\beta_2\) captures the fixed effect of \(T_{it}\)
  • \(e_{it}\) is the residual error term, modeled with a specified working correlation structure \(R\)

Step 1: Specify the working correlation matrix \(R\)

Let’s explore the observed correlation matrix first

data_wide <- ex1.data %>%
  pivot_wider(names_from = occasion, values_from = Y, names_prefix = "Y_")

# Remove the ID column
data_wide_Y <- data_wide %>% dplyr::select(starts_with("Y_"))

# Compute the observed correlation matrix for the measurements across all individuals
observed_corr_matrix <- cor(data_wide_Y, use = "pairwise.complete.obs")

# Display the observed correlation matrix
knitr::kable(observed_corr_matrix, digits = 3)
Y_0 Y_1 Y_2 Y_3 Y_4
Y_0 1.000 0.775 0.790 0.814 0.755
Y_1 0.775 1.000 0.810 0.822 0.791
Y_2 0.790 0.810 1.000 0.822 0.845
Y_3 0.814 0.822 0.822 1.000 0.813
Y_4 0.755 0.791 0.845 0.813 1.000

Aligning with the simulation, the matrix seems to be more or less exchangeable.

\(\to\) we specify the working correlation matrix \(R\) to be exchangeable

Step 2: Estimate the regression coefficients as if data were independent

We first estimate the regression coefficients as if the data were independent (ignoring the correlation between measurements within individuals). This step provides initial estimates of \(\beta\).

# Fit a generalized linear model (GLM) assuming independence
# Example using a linear model (gaussian family with identity link)
glm_fit <- glm(Y ~ X + occasion, data = ex1.data, family = gaussian(link = "identity"))

# Display the summary of the GLM fit
coef(glm_fit)
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 

The gee function also displays the information for the initial regression estimate

# Checking if this aligns with GEE function
ex1.GEE <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "exchangeable")
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 

This aligns with the fixed effects estimates from the glm function.

Step 3: Estimate the parameters in working correlation matrix \(R\) using the residuals \(e\)

We compute the residuals \(e_{it}\) based on the regression model we just fit:

# Compute residuals from the GLM model
residuals <- residuals(glm_fit)

# Compute R matrix using residuals

Next, we estimate the initial model assuming independence to get preliminary parameter estimates. Then, we compute the residuals \(e_{it}\) as:

Step 4: Use \(R\) to update the outcome variable covariance matrix \(V\)

With the working correlation structure \(R\), we update the working voriance-covariance matrix \(V_i\) for individual \(i\):

\[V(a) = \phi A_i^{1/2} R_i(a) A_i^{1/2}\]

where \(A_i\) is the diagonal matrix of variances for individual \(i\) and \(\phi\) is a scale parameter.

in the case of normally distributed outcome with homogeneous variance across time:

\[V(a) = \phi R_i(a) \]

Step 5: Use \(V\) to update the regression coefficients \(\beta\)

Step 6: After convergence is reached, apply the sandwich estimator to obtain robust standard errors

After iterating between step 3 and step 5 until convergence is reached, we apply the sandwich estimator to obtain robust standard errors.

# The GEE model summary provides robust standard errors using the sandwich estimator
robust_se <- summary(ex1.GEE)$coefficients

# Display the robust standard errors
knitr::kable(robust_se, digits = 3)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.097 0.537 3.902 0.603 3.476
X 0.565 0.104 5.445 0.111 5.106
occasion 0.000 0.017 0.014 0.018 0.013

there is a large difference between the naive and robust standard errors

Model Comparison

Let’s estimate the effect of \(X\) on \(Y_t\) using MLM (often done with model building)

Code
# Model 1a: Intercept only model, random intercept 
ex1.MLM1a <- lmer(Y ~ 1 + (1 | id), data = ex1.data, REML = FALSE) 

# Model 1b: Intercept only model including the predictor time 
ex1.MLM1b <- lmer(Y ~ 1 + occasion + (1 | id), data = ex1.data, REML = FALSE) 
# anova(ex1.MLM1a, ex1.MLM1b) # Model 1a is not significantly worse than Model 1b

# Model 2: including remainder of the level 1 predictors 
ex1.MLM2 <- lmer(Y ~ 1 + occasion + X + (1 | id), data = ex1.data, REML = FALSE)
# anova(ex1.MLM1b, ex1.MLM2) # Model 1b is significantly worse than Model 2
fixef(ex1.MLM2)
 (Intercept)     occasion            X 
2.0969471228 0.0002371629 0.5647743308 
Code
# Model 3.1: random slope for time (and add covariance)
ex1.MLM3.1 <- lmer(Y ~ 1 + occasion + X + (occasion | id), data = ex1.data, REML = FALSE)
# anova(ex1.MLM2, ex1.MLM3.1) # Model 2 is not significantly worse than Model 3.1

However, we are only really interested in the last model (with random slope and intercept)

ex1.MLM3.1_reml <- lmer(Y ~ 1 + occasion + X + (occasion | id), data = ex1.data, REML = TRUE)
ex1.MLM3.1_mle <- lmer(Y ~ 1 + occasion + X + (occasion | id), data = ex1.data, REML = FALSE)

GEE Models

ex1.GEE_exch <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "exchangeable")
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 
ex1.GEE_ind <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "independence")
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 
ex1.GEE_ar <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "AR-M")
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 
ex1.GEE_uns <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "unstructured")
 (Intercept)            X     occasion 
2.0969471228 0.5647743308 0.0002371629 
# ex1.GEE_fixed <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "fixed") # takes very long to converge

Comparison Table

Let’s compare the fixed effects

Code
# Extracting the coefficients
coefs_MLM_reml <- fixef(ex1.MLM3.1_reml)
coefs_MLM_mle <- fixef(ex1.MLM3.1_mle)
# reorder coefficients X and occasion
coefs_MLM_reml <- coefs_MLM_reml[c(1, 3, 2)]
coefs_MLM_mle <- coefs_MLM_mle[c(1, 3, 2)]
# continue with GEE estimates
coef_GEE_exch <- coef(ex1.GEE_exch)
coef_GEE_ind <- coef(ex1.GEE_ind)
coef_GEE_ar <- coef(ex1.GEE_ar)
coef_GEE_uns <- coef(ex1.GEE_uns)

# Creating a data frame with the coefficients
coefs <- data.frame(
  row.names = c("Intercept", "X", "occasion"),
  MLM_REML = coefs_MLM_reml,
  MLM_MLE = coefs_MLM_mle,
  GEE_exch = coef_GEE_exch,
  GEE_ind = coef_GEE_ind,
  GEE_ar = coef_GEE_ar,
  GEE_uns = coef_GEE_uns
)

# Create nice table
knitr::kable(coefs, digits = 3)
MLM_REML MLM_MLE GEE_exch GEE_ind GEE_ar GEE_uns
Intercept 2.092 2.092 2.097 2.097 2.178 2.069
X 0.566 0.566 0.565 0.565 0.555 0.570
occasion 0.000 0.000 0.000 0.000 -0.006 0.003

The structural causal model was specified as follows

\[Y_{it} := 2 + 0.6 \cdot X_i + \epsilon_{it}\]

Observations

  • The fixed effects are similar between and within models.
    • Best model: The unstructured GEE model
    • Worst model: the AR-M GEE model
  • interpretation [italicized unique to MLM]: for a one-unit change in \(X\), the outcome \(Y\) is expected to change by the value of \(\beta\), holding all other predictors constant and given equal values for the random effects.

Next steps

  • Writing the ethics application & research proposal
  • Extending this simple example to include covariates
  • Understanding the simulation code by Qian et al. (2020)

Information on Proposal

  • The body of the text should be max. 750 words
  • There is a clear checklist of what it should include
  • My questions
    • What candidate journal for publication?
    • what software/packages will we use?
    • Considering the exploratory nature of the project, can we already formulate a research question or hypothesis?